3.6 James-Stein Estimator

#SURE #JamesSteinEstimator #GaussianSequenceModel #BayesEstimation #MSE

1 Gaussian Sequence Model

Recall Gaussian sequence model $X \sim N_{d} (θ, I_{d})$ . The goal is to estimate $θ \in R^{d}$ via $δ (X)$ with low MSE: $MSE (θ; δ) = E_{θ} | | θ - δ (X) | |^{2} .$
The model is more general than it appears. Like if $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} N_{d} (θ, σ^{2} I_{d})$ for known $σ^{2} > 0$ , we could make a sufficiency reduction to obtain $Z = \frac{1}{σ \sqrt{n}} \sum_{i = 1}^{n} X_{i} \sim N_{d} (θ, I_{d}) .$

1.1 Bayes Estimators

If we introduce Bayesian prior $θ_{i} \overset{i . i . d}{\sim} N (0, τ^{2})$ , then Bayes estimator is $\frac{τ^{2}}{1 + τ^{2}} X$ .
We can think of this as a tuning parameter for a generic linear shrinkage estimator $δ_{ζ} (X) = (1 - ζ) X,$ where $ζ \in [0, 1]$ is in effect a tuning parameter we will call shrinkage parameter. Taking $ζ = (1 + τ^{2})^{- 1}$ corresponds to the Bayes estimator.
If we aren't sure which $ζ$ to use, (like have a priori uncertainty about $τ^{2}$ ), we can use hierarchical Bayes: $δ (X) = (1 - E [ζ | X]) X = δ_{{\hat{ζ}}_{Bayes (X)}} (X),$ so we are in effect estimating $ζ$ from the whole data set and the plugging it in as a data-adaptive tuning parameter.

If $d \geq 3$ then the UMVUE for $ζ$ is ${\hat{ζ}}_{UMVU} (X) = \frac{d - 2}{| | X | |^{2}},$ based on the fact that $Y \sim χ_{d}^{2} = Gamma (\frac{d}{2}, 2), d > 2 \Rightarrow E [\frac{1}{Y}] = \frac{1}{d - 2} .$
Plugging in ${\hat{ζ}}_{UMVU}$ results in an estimator called James-Stein estimator: $δ_{JS} (X) = (1 - \frac{d - 2}{| | X | |^{2}}) X = δ_{{\hat{ζ}}_{UMVU}} (X) .$

1.2 James-Stein Paradox

While James-Stein estimator can be motivated as an empirical Bayes estimator, it is good even without making any Bayesian assumptions at all.

For $d \geq 3$ , the estimator $X$ is actually inadmissible as an estimator of $θ$ : $MSE (θ, δ_{JS}) < MSE (θ, X), \forall θ \in R^{d} .$

This is surprising because it not only beats the UMVUE on average, but every fixed value of $θ$ .

In fact, if we use an estimator shrinking towards $θ_{0} \in R^{d}$ : $\tilde{δ} (X) = θ_{0} + (1 - \frac{d - 2}{| | X - θ_{0} | |^{2}}) (X - θ_{0})$ . This also dominates $δ_{0}$ because it is just the James-Stein estimator we'd get if we substitute $Y = X - θ_{0} \sim N_{d} (μ, I_{d}), μ = θ - θ_{0} .$
By the translation invariance of Gaussian location model, James-Stein estimator for $μ$ using $Y$ also dominates ${\hat{μ}}_{0} (Y) = Y$ , i.e. $δ_{0} (X) = {\hat{μ}}_{0} + θ_{0} = X$ for $θ$ .

1.3 Linear Shrinkage Estimators

Even without introducing a Bayesian prior for $θ$ , we can motivate our linear shrinkage estimator purely from the perspective of trading bias for a reduction in variance. Calculate MSE: $\begin{aligned} E_{θ} [(θ - δ_{i} (X))^{2}] & = (θ_{i} - E_{θ} (1 - ζ) X_{i})^{2} + {Var}_{θ} (1 - ζ) X_{i} \\ = (ζ θ_{i})^{2} + (1 - ζ)^{2}, \\ \Rightarrow MSE (θ; δ) & = ζ^{2} | | θ | |^{2} + d (1 - ζ)^{2} . \end{aligned}$
So let $\begin{aligned} 0 = \frac{d}{d ζ} MSE (θ; δ) = 2 ζ | | θ | |^{2} - 2 (1 - ζ) d \\ \Rightarrow & ζ^{*} (θ) = \frac{d}{d + | | θ | |^{2}} . \end{aligned}$
This looks similar to $\frac{1}{1 + τ^{2}}$ (Bayes-optimal $ζ$ under the Gaussian prior)

2 SURE

Theorem (Stein's Lemma)

Suppose $x \sim N (θ, σ^{2})$ and $h : R \to R$ is differentiable, with $E | \dot{h} (X) | < \infty$ . Then $\begin{matrix} (2.1) & Cov (X, h (X)) = E [(X - θ) h (X)] = σ^{2} E [\dot{h} (X)] . \end{matrix}$

Proof

First consider $θ = 0, σ^{2} = 1$ . Then $\begin{aligned} E [X h (X)] & = \int_{- \infty}^{+ \infty} x h (x) ϕ (x) d x = \int_{- \infty}^{+ \infty} h (x) d (- ϕ (x)) \\ = - h (x) ϕ (x) |_{- \infty}^{+ \infty} + \int_{- \infty}^{+ \infty} \dot{h} (x) ϕ (x) d x \\ = E [\dot{h} (X)] . \end{aligned}$ (by the fact that $\dot{ϕ} (x) = - x ϕ (x)$ )
For general $θ$ , let $X = θ + σ Z$ , where $Z \sim N (0, 1)$ , then applying the above result for $k (z) = h (θ + σ z)$ : $\begin{aligned} E [(X - θ) h (X)] = σ E [Z h (θ + σ Z)] \\ = & σ^{2} E [\dot{h} (θ + σ Z)] = σ^{2} E [\dot{h} (X)] . \end{aligned}$

Now consider the multivariate version. For a function $h : R^{d} \to R^{d}$ , define Jacobian matrix $D h \in R^{d \times d}$ : $(D h (x))_{i j} = \frac{\partial h_{i}}{\partial x_{j}} (x)$ and the Frobenius norm $A \in R^{d \times d}$ as $| | A | |_{F} = \sqrt{\sum_{i, j} A_{i j}^{2}} .$

Theorem (Stein's Lemma, Multivariate)

Assume $X \sim N_{d} (θ, σ^{2} I_{d})$ , $h : R^{d} \to R^{d}$ is differentiable with $E | | D h (X) | |_{F} < \infty$ . Then $\begin{matrix} (2.2) & E [(X - θ)^{T} h (X)] = σ^{2} E tr (D h (X)) = σ^{2} \sum_{i = 1}^{d} E \frac{\partial h_{i}}{\partial x_{i}} (X) . \end{matrix}$

Proof

By the proof of univariate case, $E [(X_{i} - θ_{i}) h_{i} (X) | X_{∖ i}] = σ^{2} E [\frac{\partial h_{i}}{\partial x_{i}} (X_{i} | X_{∖ i})] .$ Taking expectation: $E [(X_{i} - θ_{i}) h_{i} (X)] = σ^{2} E [\frac{\partial h_{i}}{\partial x_{i}} (X_{i})] .$
Summing over $i$ : $\sum_{i = 1}^{d} (X_{i} - θ_{i}) h_{i} (X) = (X - θ)^{T} h (X)$ , so we proved it.

Apply Stein's lemma to $h (x) = x - δ (x)$ , then $\begin{aligned} MSE (θ; δ) & = E_{θ} | | δ (X) - θ | |^{2} = E_{θ} | | X - h (X) - θ | |^{2} \\ = E_{θ} | | X - θ | |^{2} + E_{θ} | | h (X) | |^{2} - 2 E_{θ} [(X - θ)^{T} h (X)] \\ = σ^{2} d + E_{θ} | | h (X) | |^{2} - 2 E_{θ} tr (D h (X)) . \end{aligned}$
(Note that we assume $X \sim N_{d} (θ, σ^{2} I_{d})$ now.)
So if $σ^{2}$ is known, we obtain the unbiased estimator $\begin{matrix} (2.3) & \hat{MSE} (X) = σ^{2} d + | | h (X) | |^{2} - 2 σ^{2} tr (D h (X)) . \end{matrix}$
We call it Stein Unbiased Risk Estimator (SURE).

Example (Shrinking toward

\overset{―}{X}

)

We define an estimator that shrinks $X_{i}$ partway toward the average estimate across the $d$ coordinates: $δ_{i}^{γ} (X) = (1 - γ) X_{i} + γ \overset{―}{X} .$ (Bet most of the $θ_{i}$ are close to $\overset{―}{θ}$ ) Then $h (X) = X - δ^{γ} (X) = γ (X - \overset{―}{X} \cdot 1_{d}),$ ( $1_{d}$ is a vector with all $1$ s) and $\begin{aligned} D h (X)_{i i} & = \frac{\partial}{\partial X_{i}} γ (X_{i} - \overset{―}{X}) = γ (1 - \frac{1}{d}) \\ \Rightarrow tr (D h (X)) & = (d - 1) γ . \end{aligned}$
So an unbiased estimator for the MSE of $δ^{γ}$ is ${\hat{MSE}}^{γ} (X) = σ^{2} d + γ^{2} (d - 1) V^{2} - 2 (d - 1) γ σ^{2},$ where $V^{2} = \frac{1}{d - 1} \sum_{i = 1}^{d} (X_{i} - \overset{―}{X})^{2}$ is the sample variance.

We can use this estimator in two different ways.

To calculate the actual MSE by taking the estimator's expectation, which is the actual MSE of $δ^{γ}$ . The only RV is $V^{2}$ . Write $X_{i} = θ_{i} + Z_{i}$ , $Z_{i} \sim N (0, σ^{2})$ , then $\begin{aligned} \frac{1}{d - 1} \sum_{i = 1}^{d} E_{θ} (X_{i} - \overset{―}{X})^{2} \\ = & \frac{1}{d - 1} [\sum_{i = 1}^{d} (θ_{i} - \overset{―}{θ})^{2} + \sum_{i = 1}^{d} E [Z_{i} - \overset{―}{Z}]^{2}] = β^{2} + σ^{2}, \end{aligned}$ where $β^{2} = \frac{1}{d - 1} \sum_{i = 1}^{d} (θ_{i} - \overset{―}{θ})^{2}$ . Plugging in this expectation, $\begin{aligned} {MSE}^{γ} (θ) = E_{θ} [{\hat{MSE}}^{γ} (X)] \\ = & σ^{2} d + γ^{2} (d - 1) (β^{2} + σ^{2}) - 2 (d - 1) γ σ^{2} \\ = & σ^{2} + (d - 1) [(1 - γ)^{2} σ^{2} + (d - 1) γ^{2} β^{2}], \end{aligned}$ so $γ^{*} (β) = \frac{σ^{2}}{β^{2} + σ^{2}}$ .
When $β^{2} = 0$ , all $θ_{i}$ values are equal, and we set $γ = 1$ (shrink fully to the sample mean); if $β^{2} ≫ σ^{2}$ , we should take $γ \to 0$ (shrink very little).
To choose $γ$ adaptively to minimize this estimator: take $\hat{γ} (X) = \arg min_{γ} {\hat{MSE}}^{γ} (X) = \frac{σ^{2}}{V^{2}},$ which we could think of as an estimator of $γ^{*} (X)$ since $V^{2}$ is unbiased for $β^{2} + σ^{2}$ . If we plug in $\hat{γ} (X)$ we get a new adaptive shrinkage estimator which is not the same as $δ^{γ}$ for any fixed $γ$ , and we could use the same idea to calculate its MSE if we wanted to.

3 Risk of the James-Stein Estimator

Now we calculate the risk of James-Stein estimator $δ_{JS} (X) = (1 - \frac{d - 2}{| | X | |^{2}}) X$ .

First assume $σ^{2} = 1$ . Like above, denote $h (X) = \frac{d - 2}{| | X | |^{2}} X \Rightarrow | | h (X) | |^{2} = \frac{(d - 2)^{2}}{| | X | |^{2}},$ and by $h (X)$ we have $D h (X)_{i i} = (d - 2) \frac{\partial}{\partial X_{i}} \frac{X_{i}}{\sum_{j = 1}^{d} X_{j}^{2}} = (d - 2) \frac{| | X | |^{2} - 2 X_{i}^{2}}{| | X | |^{4}},$ then $tr (D h (X)) = (d - 2) \frac{d | | X | |^{2} - 2 | | X | |^{2}}{| | X | |^{4}} = \frac{(d - 2)^{2}}{| | X | |^{2}} .$
Plug in (2.3), $\hat{MSE} (X) = d + \frac{(d - 2)^{2}}{| | X | |^{2}} - 2 \frac{(d - 2)^{2}}{| | X | |^{2}} = d - \frac{(d - 2)^{2}}{| | X | |^{2}} .$
Taking expectations: $MSE (0; δ_{J S}) = d - (d - 2)^{2} \frac{1}{d - 2} = 2.$
(Fact mentioned here) The total MSE does not rise with $d$ !
On the other hand, suppose $| | θ | |^{2} \to \infty$ , then $E_{θ} | | X | |^{2} \to \infty$ , so $\frac{(d - 2)^{2}}{E_{θ} | | X | |^{2}}$ will be driven to $0$ .
For general $σ^{2}$ , similarly, James-Stein estimator is $(1 - \frac{(d - 2) σ^{2}}{| | X | |^{2}}) X$ , then $\hat{MSE} (X) = σ^{2} d - σ^{4} \frac{(d - 2)^{2}}{| | X | |^{2}}$ . Then $MSE (0; δ_{JS}) = 2 σ^{2}$ .